Monte Carlo approximation: Compute an expectation with samples

\( \newcommand{\water}{{\rm H_{2}O}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\E}{\mathbb{E}} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\grad}{\nabla} \newcommand{\T}{^\text{T}} \newcommand{\mathbbone}{\unicode{x1D7D9}} \renewcommand{\:}{\enspace} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\Tr}{Tr} \newcommand{\norm}[1]{\lVert #1\rVert} \newcommand{\KL}[2]{ \text{KL}\left(\left.\rule{0pt}{10pt} #1 \; \right\| \; #2 \right) } \newcommand{\slashfrac}[2]{\left.#1\middle/#2\right.} \)

We often have a (posterior) distribution \(\;p(\theta)\;\) that is intractable, and we would like to compute an expectation:

\[ \int_\Theta f(\theta)\, p(\theta)\, \text{d}\theta \]

Note that this is equivalent to a weighted average, where \(\;p(\theta)\;\) assigns an infinitesimal weight to each \(f(\theta)\). Another way to assign these weights would be to sample an infinite amount of times from the posterior distribution \(p(\theta)\), and the "weight" would be determined by how often a certain value of \(\theta\) is retrieved.

\[ \int_\Theta f(\theta)\, p(\theta)\, \text{d}\theta \;\, = \;\, \lim_{n \to \infty} \; \frac{1}{n} \sum_{i=1}^n f(\theta_i), \quad \quad \theta_i \sim p(\theta) \]

In practice we cannot draw infinite samples, but we can use this to approximate expectations if we sample a large enough amount of samples.

\[ \int_\Theta f(\theta)\, p(\theta)\, \text{d}\theta \;\, \approx \;\, \frac{1}{n} \sum_{i=1}^n f(\theta_i), \quad \theta_i \sim p(\theta) \quad \text{for large }n \]

NOTE We do not need \(\; p(\theta) \;\) to be normalized. If we draw samples from an unnormalized distribution \(\; \tilde{p}(\theta), \: p(\theta) = \frac{\tilde{p}(\theta)}{Z_p} \;\) (for example, using slice sampling), the approximation will give exactly the same result as for the normalized distribution \(\; p(\theta) \;\). This is because what matters is the relative frequency between the different sample values, rather than the absolute number of occurrences of each sample value. For example, if you duplicated each sample (repeated it), you would obtain exactly the same result because the relative frequency between different sample values would not change.